This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. You can also view the current state of the notebook as a well-formatted HTML document, if you click preview just above this document.
Try executing a chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter. You can also run the next chunk of code via Cmd+Alt+`. (These commands are valid on Mac, might differ on Windows)
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed. If you wish to get a full-fledged HTML output (that includes the interactive plots inside the document, rather than in seperate viewer windows), choose Knit to HTML instead of the Preview Notebook button.
In R (and many other programming language) reusable code written by others is distributed via libraries. In order to use those pieces of code, we first need to install the packages (this has been done previously for all users, as it takes 2+ hours to install all required packages), then we need to load them within our current session.
# Load libraries
library(magrittr) # Pipe %>% operation for clean coding
library(SingleCellExperiment) # Data container (https://bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html)
library(pcaMethods) # Linear DimRed methods (PCA and extensions)
library(RDRToolbox) # Non-linear DimRed methods (LLE, Isomap)
library(ggbiplot)
library(plotly)
library(sparsepca)
library(scran) # scRNA-seq methods, By Aaron Lun, Cambridge researcher
library(tidyverse)
library(printr)
cat("Libraries successfully loaded\n")
Libraries successfully loaded
# Read the data
data_zeisel <- readRDS("zeisel.rds")
# Get basic information of the data
print(data_zeisel)
class: SingleCellExperiment
dim: 19972 3005
metadata(0):
assays(2): counts logcounts
rownames(19972): Tspan12 Tshz1 ... Gm21943_loc3 Gm20738_loc3
rowData names(10): feature_symbol is_feature_control ... total_counts
log10_total_counts
colnames(3005): X1 X1.1 ... X9.58 X9.59
colData names(30): clust_id cell_type1 ... pct_counts_ERCC is_cell_control
reducedDimNames(0):
spikeNames(1): ERCC
# Gene statistics
print(as_tibble(rowData(data_zeisel), rownames="gene_id"))
# Cell statistics
print(as_tibble(colData(data_zeisel), rownames="cell_id"))
The easiest way to reduce the amount of our data is to select only a subset of the recorded genes to analyse.
Before we remove anything, we want to know in the end, how much information we are removing from the data by the subsetting of genes. To do this, we should first know, how much information is in the data. An easy proxy of “information” that we’ll use today is the amount of Variance in the data.
# In order to compute sample-wise variance, we need to center the data
# (Column-wise, so basically find the mean of all data and substract that from each data point to serve as the new origin)
center_colmeans <- function(X) {
Xmean = colMeans(X)
X - rep(Xmean, rep.int(nrow(X), ncol(X)))
}
# Get total variance as the trace of the covariance matrix (but to save computation, we can rotate elements in trace as Tr[X X^T] = Tr[X^T X], then realise that this is just = sum(colSums(Xc^2)) )
get_total_matrix_variance <- function(X){
Xc <- center_colmeans(X)
sum(colSums(Xc^2))
}
# Write a convenience function that directly computes the variance in our SCE data
get_total_data_variance <- function(sce_data, logcount = TRUE){
if (logcount){
sce_data %>%
logcounts() %>%
get_total_matrix_variance() ->
out
} else {
sce_data %>%
counts() %>%
get_total_matrix_variance() ->
out
}
# Return out
out
}
# Get total variance in the (log counts) data
total_logcounts_variance <- get_total_data_variance(data_zeisel)
sprintf(paste0(
"The total variance in the logcounts data is ",
get_total_data_variance(data_zeisel)
))
[1] "The total variance in the logcounts data is 36490381.8458941"
We may then remove genes that have very low total counts in the data:
# We first just remove the genes that have to small total counts across the whole dataset.
min_total_count_per_gene = 25
data_zeisel %>%
counts() %>%
rowSums() ->
tmp_gene_total_counts
data_zeisel[
tmp_gene_total_counts >= min_total_count_per_gene,
] ->
data_zeisel_lowcount
# Of course let's check how much information we still have:
sprintf(paste0(
"After removing low count genes, we retain ",
format(
get_total_data_variance(data_zeisel_lowcount) / total_logcounts_variance *100,
digits=3
),
"%% of the variance."
))
[1] "After removing low count genes, we retain 94.5% of the variance."
sprintf(paste0(
"We now have ", nrow(data_zeisel_lowcount), " / ", nrow(data_zeisel), " genes remain"
))
[1] "We now have 15251 / 19972 genes remain"
Next we check from the remaining genes, which one on average have the most variation in fold-changes across the different cells. As our scientific goal currently is to find a way to characterise how cells differ from one-another, this is a reasonable thing to do.
how_many_genes_to_keep = 500
# Next we check from the remaining genes, which one on average have the most fold_changes
data_zeisel_lowcount %>%
logcounts() %>% # logcounts assay
rowVars ->
gene_count_variances # save row-wise variances
# Add back the names (unfortunately rowVars deletets them)
names(gene_count_variances) <- rownames(data_zeisel_lowcount)
# Do the subsetting based on top N variances (by sorted gene name)
data_zeisel_lowcount[
names((gene_count_variances %>% sort(decreasing = TRUE))[1:how_many_genes_to_keep]),
] ->
data_zeisel_topgenes
# Of course let's check how much information we still have:
sprintf(paste0(
"After keeping only the top ", how_many_genes_to_keep, " genes, we retain ",
format(
get_total_data_variance(data_zeisel_topgenes) / total_logcounts_variance *100,
digits=3
),
"%% of the variance."
))
[1] "After keeping only the top 500 genes, we retain 12.1% of the variance."
# # Let's take a look at our expression heatmap
# data_zeisel_topgenes %>%
# scater::plotHeatmap(
# features = 1:dim(.)[1],
# cluster_rows = FALSE,
# cluster_cols = FALSE
# )
Now we may wish to do some more advanced dimensionality reduction, that takes into account not just properties of single genes, but also how they interact. Principal Component Analysis (PCA) is one such methods, it attempts to reduce the dimensionality of the data not just by getting rid of single genes, but by finding linear combinations of gene expression patterns that are informative, and keep most of the variance intact.
# Run PCA on the data
results_prcomp <- prcomp(
data_zeisel_topgenes %>% logcounts() %>% t() # Tranpose is necessary because base R algorithms think in data frames, where samples are in rows, and features are in columns
)
plot_pca_result(results_prcomp, interactive = FALSE)
Now what have we learned, and was this helpful? Not really, interactive mode can point out outliers maybe (although PCA is not very robust to them). And what does PC1 and PC2 really mean?
We can look at what PC1 is:
as.data.frame(tibble::enframe(results_prcomp$rotation[,i][order(abs(results_prcomp$rotation[,i]), decreasing = TRUE)], value="weight"))
name weight
1 Plp1 0.1512231585
2 Trf 0.1184318670
3 Mal 0.1119303119
4 Apod 0.0991299868
5 Mog 0.0988120846
6 Car2 0.0954679536
7 Mbp 0.0937816511
8 Atp1b1 -0.0927071110
9 Cnp 0.0926628445
10 Ugt8a 0.0908801969
11 Snap25 -0.0868271964
12 Rtn1 -0.0853471471
13 Snhg11 -0.0853076663
14 Mobp 0.0852174670
15 Stmn3 -0.0851872221
16 Meg3 -0.0850614524
17 Ermn 0.0847810312
18 Enpp2 0.0844830955
19 Chn1 -0.0829452467
20 Mag 0.0808153672
21 Syt1 -0.0805356154
22 Cryab 0.0772241962
23 Snca -0.0771291675
24 Hpca -0.0765301504
25 Gria2 -0.0764177965
26 Tspan2 0.0755472332
27 Qk 0.0751080097
28 Gpm6a -0.0750386341
29 Cldn11 0.0750383274
30 Ppp3ca -0.0746356622
31 Cd9 0.0745513224
32 Atp2b1 -0.0737975062
33 Prkcb -0.0733255014
34 Nsf -0.0732832192
35 Basp1 -0.0732304888
36 Stmn2 -0.0731004359
37 Tubb2a -0.0727677277
38 Atp1a3 -0.0717133727
39 Ndrg4 -0.0714872184
40 Sept4 0.0707675484
41 Gsn 0.0695018571
42 Ywhah -0.0683160904
43 Dbi 0.0682783109
44 Gatm 0.0682077558
45 Scn2a1 -0.0673519326
46 Cmtm5 0.0665534230
47 Eno2 -0.0664872905
48 Grin2b -0.0664800573
49 Npc2 0.0662747616
50 Grm5 -0.0660794655
51 Ttc3 -0.0658067338
52 Qdpr 0.0652170197
53 Olfm1 -0.0649159001
54 Csrp1 0.0648917366
55 Nrgn -0.0648546351
56 Scg5 -0.0648122473
57 Gria1 -0.0646826834
58 Scd2 0.0646723108
59 Aspa 0.0635589232
60 Dlgap1 -0.0633828874
61 Thy1 -0.0630788533
62 Vsnl1 -0.0630204690
63 3110035E14Rik -0.0628254239
64 Tmem88b 0.0627953368
65 Ptgds 0.0625596742
66 Cpe -0.0622721679
67 Celf4 -0.0621523198
68 Tspan13 -0.0620052479
69 Syn2 -0.0618216797
70 Map1b -0.0617969243
71 Cck -0.0615945720
72 Ncdn -0.0609322847
73 Calm2 -0.0609191362
74 Opalin 0.0607256572
75 Ppp3r1 -0.0604381559
76 Pdlim2 0.0602987901
77 Napb -0.0599431118
78 Pcsk2 -0.0598603917
79 Cyfip2 -0.0595898481
80 Gap43 -0.0595753460
81 Dnm1 -0.0592848563
82 Gabra1 -0.0586533665
83 Uchl1 -0.0585312001
84 Nsg2 -0.0583097011
85 D3Bwg0562e -0.0582882344
86 Camk2b -0.0582699060
87 Slc8a1 -0.0581316051
88 Rnf13 0.0578877620
89 Rasgrp1 -0.0573868949
90 Ppp3cb -0.0571065831
91 Pja2 -0.0566133130
92 Psat1 0.0564942606
93 Ywhag -0.0564470492
94 Arf3 -0.0564274022
95 Cd81 0.0563185759
96 Celf2 -0.0563176087
97 Mllt11 -0.0557743640
98 Syp -0.0554515145
99 Pgm2l1 -0.0553307240
100 Grb14 0.0550364565
101 2900097C17Rik -0.0548232666
102 Kif5c -0.0546580146
103 Zcchc18 -0.0546222514
104 Pde1a -0.0543013065
105 Cx3cl1 -0.0541693367
106 Atp6v1g2 -0.0540465925
107 Prkar1b -0.0535963529
108 Tuba4a -0.0533701293
109 Glul 0.0533471101
110 Camk2a -0.0532664560
111 Camkv -0.0531537313
112 Rab3a -0.0530295619
113 Gabrb2 -0.0529285559
114 Phgdh 0.0526309896
115 Rbfox1 -0.0525520726
116 Snurf -0.0525338267
117 Mapk10 -0.0525098527
118 Neurod6 -0.0524221383
119 Cers2 0.0523790131
120 Pllp 0.0523475520
121 Gda -0.0521827446
122 Nptn -0.0520983115
123 Myt1l -0.0520870534
124 Gpr37 0.0519681311
125 Sepp1 0.0514071017
126 Slc12a2 0.0507642173
127 Prdx1 0.0507496979
128 Cpne6 -0.0507028278
129 Epha4 -0.0506862929
130 Fbxw7 -0.0506319256
131 Dbndd2 0.0504558601
132 Crym -0.0502029415
133 Mdh1 -0.0494478898
134 Ngfrap1 -0.0494437311
135 Atp6v0c-ps2 -0.0494388685
136 Taldo1 0.0493722719
137 Wasf1 -0.0491817144
138 Fam131a -0.0489454328
139 Gabrg2 -0.0489247762
140 Ywhaz -0.0486943091
141 Ptprn -0.0486062773
142 Rbfox3 -0.0485890538
143 Calm3 -0.0485730845
144 Aldoa -0.0483798589
145 Gria3 -0.0483568553
146 Elovl7 0.0482904341
147 Synj1 -0.0481565343
148 Nrxn1 -0.0480623880
149 Ensa -0.0480525767
150 Sc4mol 0.0479714658
151 Dclk1 -0.0479622603
152 Celf5 -0.0478839029
153 Syt4 -0.0476483340
154 Brinp1 -0.0475993711
155 Slc12a5 -0.0475082966
156 Slc38a2 0.0474808931
157 Chgb -0.0474575605
158 Sv2b -0.0472659590
159 Elmod1 -0.0472202969
160 Ywhab -0.0468300004
161 Atp2b2 -0.0467308753
162 2810468N07Rik 0.0466799870
163 Prkce -0.0465873218
164 Ociad2 -0.0465089745
165 Rab6b -0.0465061201
166 Cplx2 -0.0463002141
167 Ntm -0.0461266208
168 Plekhb1 0.0460448384
169 Maged1 -0.0458650137
170 Chl1 -0.0458469367
171 Atp6v1b2 -0.0455643746
172 Litaf 0.0455442211
173 Scg2 -0.0455396461
174 Syn1 -0.0453830808
175 Gng11 0.0453439649
176 Caly -0.0451921630
177 Rab3c -0.0450917523
178 Nell2 -0.0449689558
179 Ptprs -0.0448969629
180 Cdk5r1 -0.0447913693
181 Ptn 0.0446952291
182 Kalrn -0.0444743174
183 Sptan1 -0.0441070509
184 Evi2a-evi2b 0.0440892611
185 Tenm2 -0.0440533499
186 Snap91 -0.0439981760
187 Got1 -0.0438925457
188 Scn8a -0.0438730190
189 Ctxn1 -0.0438502903
190 Hmgcs1 0.0437288126
191 Lmo4 -0.0435216091
192 Gprasp1 -0.0435103715
193 Ptk2b -0.0433335751
194 Serpini1 -0.0431802040
195 Gpr22 -0.0431435281
196 Ndrg3 -0.0431336154
197 Arpp21 -0.0430369869
198 Zbtb18 -0.0429954556
199 Ppm1e -0.0428969110
200 Pld3 -0.0428733492
201 Nrn1 -0.0428579352
202 Cacna2d1 -0.0428296996
203 Tpi1 -0.0427684050
204 Cadm2 -0.0427190994
205 Pfn2 -0.0427116546
206 Stxbp1 -0.0426794041
207 Enc1 -0.0425854793
208 Atp2a2 -0.0425496999
209 Sh3gl2 -0.0424509160
210 Kcnd2 -0.0424075365
211 Olig1 0.0423191030
212 Ndrg1 0.0422824238
213 Pkm -0.0422624106
214 Sypl 0.0422417048
215 Reep3 0.0422159513
216 Tubb3 -0.0421805676
217 Ftl1 0.0421018533
218 2010300C02Rik -0.0420940218
219 Ldha -0.0420630656
220 Cyp51 0.0420126899
221 Gnas -0.0419969767
222 Gls -0.0419664334
223 Plk2 -0.0419307678
224 Nrxn3 -0.0419161435
225 Rgs4 -0.0416645522
226 Frrs1l -0.0415617046
227 Fabp5 0.0415396141
228 Camk2n1 -0.0415024241
229 Dgkg -0.0414756049
230 Atp6v1d -0.0414710915
231 Gjc3 0.0414326130
232 Atp6v1a -0.0413493097
233 Pgam1 -0.0413116182
234 Jam3 0.0410812858
235 Mtpn -0.0410730458
236 Kcnma1 -0.0409681496
237 Phldb1 0.0409512632
238 Mapk1 -0.0409205846
239 Bcl11b -0.0406004566
240 Kifap3 -0.0405706584
241 Itm2c -0.0405534298
242 Ahi1 -0.0405162447
243 Cnksr2 -0.0404707021
244 Sirt2 0.0404195013
245 Baiap2 -0.0404171252
246 Car14 0.0403645546
247 Rgs7bp -0.0403299095
248 Pgrmc1 -0.0402630230
249 Camta1 -0.0401132555
250 Sub1 -0.0400926961
251 Mt1 0.0400243836
252 Nov -0.0400221818
253 Dkk3 -0.0397158577
254 Slc48a1 0.0394878227
255 Rnf7 0.0391880301
256 Hsph1 -0.0390343487
257 Clic4 0.0387717705
258 Cmip -0.0387368877
259 Abr -0.0387214901
260 Gng5 0.0387148112
261 Ptma 0.0382611287
262 Rab6a -0.0382475033
263 Snap47 -0.0381483674
264 Tagln3 -0.0378067597
265 Tmeff2 0.0376942810
266 Nbea -0.0376460950
267 Npcd -0.0375058877
268 Cntn1 -0.0374838201
269 Kcnb1 -0.0371608676
270 Ddr1 0.0368337167
271 Cpne7 -0.0366024950
272 Fermt2 0.0364664002
273 Pebp1 -0.0364405455
274 Tmx4 -0.0363428485
275 S100b 0.0361816271
276 Cd63 0.0360047246
277 Idh3b -0.0359545133
278 Atpif1 -0.0359526163
279 Nme1 -0.0358935290
280 Zwint -0.0358505514
281 Wfs1 -0.0358335168
282 Slc25a3 -0.0356908168
283 Adcy1 -0.0355791795
284 Atp6v0d1 -0.0354647792
285 Atp1a2 0.0353791611
286 Miat -0.0353416561
287 Lamp2 0.0352614213
288 Bpgm 0.0351809175
289 Strbp -0.0350888357
290 S100a1 0.0350692795
291 Jun 0.0349345780
292 Chd3 -0.0349193740
293 Sorl1 -0.0348763799
294 Hprt -0.0348128948
295 Pfkp -0.0347275499
296 Dpy19l1 0.0345897250
297 Purb -0.0345548391
298 Ctsl 0.0345245541
299 Actr2 -0.0342891750
300 Mycbp2 -0.0341720906
301 Gnao1 -0.0340223115
302 Eif4a2 -0.0338912026
303 Atp5g1 -0.0338571570
304 Foxp1 -0.0335223976
305 Atp5b -0.0333563369
306 Rtn3 -0.0331263310
307 Degs1 0.0331071477
308 Arf1 -0.0328521406
309 Gm5506_loc1 -0.0325597546
310 Fth1 0.0325498729
311 Fos 0.0322522713
312 Fkbp3 -0.0322392160
313 Cadm3 -0.0321086889
314 1810037I17Rik 0.0319324290
315 Gp1bb -0.0319053548
316 Myl12b -0.0315982804
317 Prdx6 0.0314306838
318 Tspan7 -0.0313931474
319 Slc22a17 -0.0313699302
320 Apoe 0.0313442928
321 Prnp -0.0312669689
322 G3bp2 -0.0309682491
323 Atp6v1e1 -0.0308307440
324 Cplx1 -0.0307952131
325 Sptbn1 -0.0306580123
326 Cadm1 -0.0305719494
327 Ntrk2 -0.0304816949
328 Tcf4 -0.0303330992
329 Tsc22d1 -0.0303083772
330 Dynll2 -0.0301432939
331 Atp1b3 0.0300243036
332 Gstp1 0.0299448994
333 Ryr2 -0.0298833966
334 Usmg5 -0.0298498387
335 Trim2 -0.0297801245
336 Sparc 0.0297119223
337 Cttnbp2 -0.0294919603
338 Cnr1 -0.0290362603
339 Aatk 0.0290271034
340 Ptprd 0.0290240917
341 Slc1a3 0.0288899416
342 Tubb4a 0.0285761394
343 Hspa8 -0.0284383833
344 Alcam -0.0284081818
345 Cntn2 0.0283935376
346 Atrx -0.0280128494
347 Tubb5 -0.0276575303
348 Tuba1b -0.0275957265
349 Kif5b -0.0274051474
350 Tmod2 -0.0273414149
351 Mgst3 -0.0271767650
352 Cox7b -0.0271746624
353 Pcdh9 0.0269979876
354 Pls3 -0.0263914902
355 Slc25a4 -0.0263852952
356 Timp2 -0.0262615271
357 Tmsb10 -0.0261928223
358 Nrip3 -0.0261564714
359 Nnat -0.0261490593
360 1500004A13Rik 0.0261098136
361 Ndufa5 -0.0257939939
362 Ndufa4 -0.0252504077
363 Vmp1 0.0248372348
364 Cltc -0.0245993720
365 Slc24a2 -0.0244807357
366 Gstm5 0.0244175284
367 Ank2 -0.0244003473
368 Efnb3 0.0242479827
369 Rps20 0.0233663548
370 H3f3a 0.0233204344
371 Rpl7 0.0230251838
372 Ghitm -0.0229452321
373 Sat1 0.0228016035
374 Gm9846 0.0227920528
375 Ppap2b 0.0226466272
376 Dusp1 0.0225981751
377 Rps3 0.0225231343
378 Nfib -0.0223714513
379 Nap1l5 -0.0223112226
380 Dner -0.0220989694
381 Anks1b -0.0220467127
382 Kcnq1ot1 -0.0219293649
383 Fosb 0.0216528176
384 Gnai2 0.0214823086
385 Anxa5 0.0213896693
386 Secisbp2l 0.0210364166
387 Fkbp1a -0.0209442042
388 Cacnb4 -0.0209335898
389 Rps29 0.0208236413
390 Zfp36 0.0207796164
391 Morf4l2 -0.0207389994
392 Deb1 0.0206252529
393 Rps6 0.0206136313
394 Rps15a-ps4 0.0201819675
395 Dynll1 -0.0200864586
396 Rps24 0.0200615914
397 Kif1a -0.0197486490
398 Mfge8 0.0192450801
399 Pcp4 -0.0191849820
400 Nfasc 0.0190847374
401 Klf4 0.0189028215
402 Gpr37l1 0.0188821047
403 Id3 0.0187836638
404 Rpl34-ps1 0.0185598920
405 Ndufa1 -0.0185193873
406 Ldhb -0.0181796944
407 Sparcl1 -0.0181717998
408 Gm6402 0.0180548472
409 Cycs -0.0179566183
410 Glo1 0.0173871313
411 Mt2 0.0170529641
412 Mef2c -0.0169067624
413 B2m 0.0168546882
414 Rpl39 0.0165978355
415 Gpm6b 0.0165464299
416 Gstm1 0.0164136387
417 Ywhaq 0.0159843975
418 Rps18 0.0159232472
419 Slco1c1 0.0158574434
420 Omg 0.0157975236
421 Gja1 0.0157616906
422 Atp5o -0.0157050934
423 Kcna1 0.0154814495
424 Xist 0.0154764626
425 Luc7l3 -0.0154628511
426 Marcks -0.0154566344
427 Pea15a 0.0153910229
428 Btg2 0.0152158196
429 Dpysl2 -0.0151938131
430 Rps3a1 0.0151286709
431 Gm15421 -0.0151181240
432 Zfp36l1 0.0150688055
433 Cyr61 0.0149714074
434 Nfkbia 0.0148377995
435 Atp5k -0.0147365161
436 Ly6e -0.0146758061
437 Rpl32 0.0144245661
438 Itm2a 0.0143783961
439 Pcdh17 -0.0142548806
440 Ly6c1 0.0141353087
441 2900060B14Rik -0.0140206980
442 Atp5j2 -0.0139887152
443 Pla2g7 0.0138932220
444 Ubc 0.0135401507
445 Stmn4 0.0133879068
446 Mmd2 0.0129128004
447 Aqp4 0.0127933883
448 Ndrg2 0.0126087113
449 Zeb2 -0.0126017782
450 Bcan 0.0124597973
451 Pltp 0.0123852409
452 Ndufc1 -0.0123198216
453 Sh3bgrl3 -0.0120811567
454 Slco1a4 0.0120673511
455 Gad1 -0.0119723430
456 Gad2 -0.0119687650
457 Pisd-ps3 -0.0113540619
458 Bsg -0.0111193500
459 Sumo2 -0.0110710704
460 Cox7a2 -0.0110414553
461 Slc4a4 0.0109365063
462 Srsf11 -0.0108817401
463 Rpl31-ps12 0.0106882957
464 Aldoc 0.0104227025
465 Mif -0.0102354949
466 Srrm2 -0.0101836074
467 Acot7 -0.0101666357
468 Scg3 -0.0098045808
469 Atp5f1 -0.0097006093
470 2010107E04Rik -0.0094175915
471 Htr3a -0.0093346158
472 Id2 -0.0092830864
473 Aplp1 0.0088738078
474 Cox6b1 -0.0087876541
475 Atp5l -0.0087052693
476 Clu -0.0084545823
477 Cst3 0.0078706415
478 Hes1 0.0072921032
479 Acta2 0.0068004040
480 Ncald -0.0065495233
481 Tecr 0.0061355291
482 Atp5j -0.0058849356
483 Npy -0.0057826019
484 Syt11 0.0055345134
485 Ndufb4 -0.0054895872
486 Prkacb -0.0051883092
487 Ckb 0.0050882670
488 Egr1 -0.0037537489
489 Lars2 0.0037112272
490 Vip -0.0035437100
491 Sst -0.0035297699
492 Uqcrh 0.0033672429
493 Slc1a2 -0.0029527647
494 Osbpl1a 0.0029010677
495 Atp1b2 -0.0024692890
496 Tmsb4x -0.0020913769
497 Serpine2 0.0020746148
498 Tuba1a 0.0020134896
499 Efr3b 0.0010979011
500 Slc6a1 0.0008816388
Still not terribly helpful, but this represents some linear combination of (log) gene expressions. It actually means an equation for a single line in the space of all possible expression patterns. For a hypotethical “cell_x”, that has expression x_
\[ \textrm{PC1} ( \textrm{cell_x} ) = 0.15 * x_\textrm{Plp1} + 0.11 * x_\textrm{Trf} - 0.08 * x_\textrm{Snap25} + \dots + 0.0008 * x_\textrm{Slc6a1} \\ = \vec{w}_1 \cdot \vec{x}\]
Ok so this means that when some of those genes in the equation are upregulated together, and some others are down-regulated at the same time. That’s great, I can go test that?! No, not really, this equation involves ALL measured genes, so it is not particularly useful to create new hypothesis.
What can we do about that?
```
First, let's look at what PCA actually does!
(Continued on the board)
```
Now with our extra penalty term for using too many genes to explain variance, we can make a trade-off between how much variance our PC1 explains, versus how many genes the \(w1\) vector involves to compute that PC1.